Take-home Exercise 3

Explore different perspectives and approaches to create more truthful and enlightening data visualisation

Huan Li https://linkedin.com/in/huan-li-ab7498124/ (SMU, SCIS, Master of IT in Business)https://scis.smu.edu.sg/master-it-business/about-mitb-main
2022-05-16

1. Overview

Anticipating rapid growth, the city of Engagement, Ohio USA is doing a participatory urban planning exercise to understand the current state of the city and identify opportunities for future growth. About 1000 representative residents in this modest-sized city have agreed to provide data using the city’s urban planning app, which records the places they visit, their spending, and their purchases, among other things. From these volunteers, the city will have data to assist with their major community revitalization efforts, including how to allocate a very large city renewal grant they have recently received.

Economic considers the financial health of the city. How does the financial health of the residents change over the period covered by the dataset? How do wages compare to the overall cost of living in Engagement? Are there groups that appear to exhibit similar patterns?

In this exercise, we will explore different perspectives and approaches to create more enlightening data visualisation on dataset VAST Challenge 2022. The operation was carried out on Rstudio and main packages used are tidyverse and ggplot2 extensions.

2. Data Preparation

2.1 Installing and loading the required libraries

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

The chunk code on the right will do the trick.

hide
packages = c('tidyverse', 'knitr', 'ggdist', 'ggridges',
             'scales', 'grid', 'gridExtra','plotly',
             'ggrepel', 'formattable', 'patchwork',
             'ggiraph', 'lubridate', 'data.table',
             'ggthemes','gganimate','gifski','gapminder')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

2.2 Importing the dataset

The code chunk below imports participants.csv and FinancialJournal.csv into R environment using read_csv() function of readr package.

hide
participants <- read_csv('data/Participants.csv')
financial <- read_csv('data/FinancialJournal.csv')

It is always a good practice to examine the imported data frame before further analysis is performed.

For example, kable() can be used to review the structure of the imported data frame.

Let’s take an overview of the datasets.

hide
kable(head(participants))
participantId householdSize haveKids age educationLevel interestGroup joviality
0 3 TRUE 36 HighSchoolOrCollege H 0.0016267
1 3 TRUE 25 HighSchoolOrCollege B 0.3280865
2 3 TRUE 35 HighSchoolOrCollege A 0.3934696
3 3 TRUE 21 HighSchoolOrCollege I 0.1380634
4 3 TRUE 43 Bachelors H 0.8573967
5 3 TRUE 32 HighSchoolOrCollege D 0.7729578
hide
kable(head(financial))
participantId timestamp category amount
0 2022-03-01 Wage 2472.50756
0 2022-03-01 Shelter -554.98862
0 2022-03-01 Education -38.00538
1 2022-03-01 Wage 2046.56221
1 2022-03-01 Shelter -554.98862
1 2022-03-01 Education -38.00538

2.3 Data Wrangling

In order to understand the financial health of the residents change over the period, we need to derive income, overall cost and balance of residents in a monthly basis.

2.3.1 Dealling with time interval

Monthly income/cost for residents need to be derived to view the change over recorded 15 months. Code chunk below shows how we change time format to monthly basis.

Switch to monthly basis

hide
monthlyFinancial <- financial %>% 
  mutate(yearmonth = format(as.Date(timestamp), "%Y.%m")) %>% 
  select(-timestamp)
monthlyFinancial
# A tibble: 1,856,330 x 4
   participantId category  amount yearmonth
           <dbl> <chr>      <dbl> <chr>    
 1             0 Wage      2473.  2022.03  
 2             0 Shelter   -555.  2022.03  
 3             0 Education  -38.0 2022.03  
 4             1 Wage      2047.  2022.03  
 5             1 Shelter   -555.  2022.03  
 6             1 Education  -38.0 2022.03  
 7             2 Wage      2437.  2022.03  
 8             2 Shelter   -557.  2022.03  
 9             2 Education  -12.8 2022.03  
10             3 Wage      2367.  2022.03  
# ... with 1,856,320 more rows

Convert 15 months into serial numbers

hide
mon_convert <- function(y, m){mon = 12*(y-2022)+m-2}
print(mon_convert(2023,5))
[1] 15
hide
MonthlyFinancial <- year_month %>%
  mutate(SerialMonth = mon_convert(year_month$y, year_month$m))
summary(MonthlyFinancial$SerialMonth)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   4.000   8.000   7.732  11.000  15.000 

2.3.2 Pivot Dataframe

Firstly, we need to use group_by to group individual expense and income category. And then use summarise function to summarize each category.

hide
summarizedFinancial <- MonthlyFinancial %>% 
  group_by(participantId, category, SerialMonth, yearmonth) %>% 
  summarise(monthly_financial = sum(amount))
summarizedFinancial
# A tibble: 55,498 x 5
# Groups:   participantId, category, SerialMonth [55,498]
   participantId category  SerialMonth yearmonth monthly_financial
           <dbl> <chr>           <dbl> <chr>                 <dbl>
 1             0 Education           1 2022.03               -76.0
 2             0 Education           2 2022.04               -38.0
 3             0 Education           3 2022.05               -38.0
 4             0 Education           4 2022.06               -38.0
 5             0 Education           5 2022.07               -38.0
 6             0 Education           6 2022.08               -38.0
 7             0 Education           7 2022.09               -38.0
 8             0 Education           8 2022.10               -38.0
 9             0 Education           9 2022.11               -38.0
10             0 Education          10 2022.12               -38.0
# ... with 55,488 more rows

Then, the dataframe need to be pivoted using code chunk below.

hide
Financial <- summarizedFinancial %>% 
  pivot_wider(names_from = category, values_from = monthly_financial)
Financial[is.na(Financial)] = 0
Financial
# A tibble: 13,331 x 9
# Groups:   participantId, SerialMonth [13,331]
   participantId SerialMonth yearmonth Education  Food Recreation
           <dbl>       <dbl> <chr>         <dbl> <dbl>      <dbl>
 1             0           1 2022.03       -76.0 -268.     -349. 
 2             0           2 2022.04       -38.0 -266.     -219. 
 3             0           3 2022.05       -38.0 -265.     -383. 
 4             0           4 2022.06       -38.0 -257.     -466. 
 5             0           5 2022.07       -38.0 -270.    -1070. 
 6             0           6 2022.08       -38.0 -262.     -314. 
 7             0           7 2022.09       -38.0 -256.     -295. 
 8             0           8 2022.10       -38.0 -267.      -25.0
 9             0           9 2022.11       -38.0 -261.     -377. 
10             0          10 2022.12       -38.0 -266.     -357. 
# ... with 13,321 more rows, and 3 more variables: Shelter <dbl>,
#   Wage <dbl>, RentAdjustment <dbl>

2.3.3 Deriving monthly-income, cost-of-ling and monthly-balance

To show the change of financial situation of residents during this 15 months,we need to calculate the monthly income , monthly living cost as well as monthly balance and then used them to visualize in the next part.

Residents’ monthly income is derived by calculating the sum of wage.

Cost of living is made up of expenses from education, food, recreation, shelter, and offset the rent adjustment. Residents’ monthly cost of living is derived by calculating the sum of above expenses.

hide
FINANCIAL <- Financial %>% 
  mutate(monthly_cost = Education + Food + Recreation
         + Shelter + RentAdjustment) %>% 
  mutate(monthly_income = Wage) %>% 
  mutate(monthly_balance = monthly_income + monthly_cost)
FINANCIAL
# A tibble: 13,331 x 12
# Groups:   participantId, SerialMonth [13,331]
   participantId SerialMonth yearmonth Education  Food Recreation
           <dbl>       <dbl> <chr>         <dbl> <dbl>      <dbl>
 1             0           1 2022.03       -76.0 -268.     -349. 
 2             0           2 2022.04       -38.0 -266.     -219. 
 3             0           3 2022.05       -38.0 -265.     -383. 
 4             0           4 2022.06       -38.0 -257.     -466. 
 5             0           5 2022.07       -38.0 -270.    -1070. 
 6             0           6 2022.08       -38.0 -262.     -314. 
 7             0           7 2022.09       -38.0 -256.     -295. 
 8             0           8 2022.10       -38.0 -267.      -25.0
 9             0           9 2022.11       -38.0 -261.     -377. 
10             0          10 2022.12       -38.0 -266.     -357. 
# ... with 13,321 more rows, and 6 more variables: Shelter <dbl>,
#   Wage <dbl>, RentAdjustment <dbl>, monthly_cost <dbl>,
#   monthly_income <dbl>, monthly_balance <dbl>

2.3.4 Join tables

In order to show the income and consumption patterns in different groups, we will combine FINANCIAL and participants dataframe together.

hide
combine <- FINANCIAL %>% 
  left_join(participants, by = "participantId")
combine
# A tibble: 13,331 x 18
# Groups:   participantId, SerialMonth [13,331]
   participantId SerialMonth yearmonth Education  Food Recreation
           <dbl>       <dbl> <chr>         <dbl> <dbl>      <dbl>
 1             0           1 2022.03       -76.0 -268.     -349. 
 2             0           2 2022.04       -38.0 -266.     -219. 
 3             0           3 2022.05       -38.0 -265.     -383. 
 4             0           4 2022.06       -38.0 -257.     -466. 
 5             0           5 2022.07       -38.0 -270.    -1070. 
 6             0           6 2022.08       -38.0 -262.     -314. 
 7             0           7 2022.09       -38.0 -256.     -295. 
 8             0           8 2022.10       -38.0 -267.      -25.0
 9             0           9 2022.11       -38.0 -261.     -377. 
10             0          10 2022.12       -38.0 -266.     -357. 
# ... with 13,321 more rows, and 12 more variables: Shelter <dbl>,
#   Wage <dbl>, RentAdjustment <dbl>, monthly_cost <dbl>,
#   monthly_income <dbl>, monthly_balance <dbl>, householdSize <dbl>,
#   haveKids <lgl>, age <dbl>, educationLevel <chr>,
#   interestGroup <chr>, joviality <dbl>

3. Visulisations and Insights

3.1 How does the financial health of the residents change over the period?

To visualise the financial change during this 15 months, we will use ridge plot to show the economic situation

hide
p1 <- ggplot(combine, 
             aes(x=Wage, 
                 y=combine$yearmonth, 
                 fill = factor(stat(quantile)))) +
  stat_density_ridges(geom = "density_ridges_gradient", 
                      calc_ecdf = TRUE,
                      quantiles = 4, 
                      quantile_lines = TRUE) +
  scale_fill_viridis_d(name = "Quartiles") +
  labs(x= "Wage",
       y= "Time",
       title="Distribution of Residents' Wage")
p1

According to above ridge plot, we can know that residents’ wage in 2022 March is higher than the following 14 months.

hide
p2 <- ggplot(combine,
            aes(x = combine$monthly_balance,
                y = combine$yearmonth))+
  geom_density_ridges(jittered_points = TRUE,
                      position = position_points_jitter(width = 0.05, 
                                                        height = 0),
                      point_shape = '|', 
                      point_size = 3, 
                      point_alpha = 1, 
                      alpha = 0.7,) +
  stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE) +
  geom_density_ridges_gradient(scale = 2, rel_min_height = 0.01) +
  scale_fill_viridis_c(name = "Monthly Balance", direction = -1) +
  geom_vline(aes(xintercept=mean(combine$monthly_balance, na.rm=T)),
             color="red", 
             linetype="dashed", 
             size=0.5) +
  geom_vline(aes(xintercept=median(combine$monthly_balance, na.rm=T)),
             color="blue",
             linetype="dashed", 
             size=0.5) +
  theme(axis.title.y=element_text(angle=0),
      axis.line = element_line(color='grey'), 
      plot.title = element_text(hjust = 0.5),
      axis.title.y.left = element_text(vjust = 0.5,), 
      axis.text = element_text(face="bold")) +
  labs(x= "Monthly Balance",
       y= "Time",
       title="Distribution of Residents' Monthly Balance")

p2

Accordingly, the distribution of monthly balance is in a similar pattern. And the dot plot delow the ridge plot shows that people with higher balance is affected even more serious.

3.2 How do wages compare to the overall cost of living in Engagement?

hide
plot_ly(data = combine,
        x = ~Wage,
        y = abs(combine$monthly_cost),
        text = ~paste("Period:", yearmonth,
                      "<br>Balance:", monthly_balance),
        color = ~yearmonth) %>%
  layout(title = 'Monthly Wage versus Monthly Living Cost',
         ylabel = 'Monthly Cost of Living')

Residents’ wages are more widely distributed in the early stage and perform in a similar pattern after the 1st month.

3.3 Are there groups that appear to exhibit similar patterns?

3.3.1 Financial patterns in different education levels

Distribution of monthly balance versus education level

First, Let’s have a general overview of the distribution of monthly balance with different education level.

hide
p <- ggplot(data = combine,
            aes(x=monthly_balance,
                color = educationLevel)) +
  geom_density() +
  labs(x= "Monthly Balance",
       y= "Density",
       title="Density distribution of Residents' Monthly Balance",
       subtitle= 'Demographics in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  theme(panel.background = element_blank(),
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        axis.title.y = element_text(angle=0,vjust = 0.5),
        axis.ticks.x = element_blank(),
        axis.line= element_line(color= 'grey'),
        axis.text.x = element_text(size=8,angle=0),
        panel.grid.major.y = element_line(color= 'grey', size = 0.1),
        plot.caption = element_text(hjust=0),
        legend.key = element_rect(fill= NA))
ggplotly(p)

The density plot shows that there are more people from higher the education level for the higher monthly balance.

Below, ggdist package is used to plot raincloud plots so as to show more details of the distribution of monthly balance in diffrent education level.

hide
ggplot(combine, aes(x = educationLevel, 
                    y = monthly_balance)) +
  scale_y_continuous(breaks = seq(0, 18000, 3000), 
                     limits = c(0, 18000)) +
  stat_halfeye(adjust = 0.5, 
               width = .66, 
               color = NA,
               justification = -.01,
               position = position_nudge(x = .15)) + 
  geom_boxplot(width = .20,
               outlier.shape = NA) +
  stat_summary(geom = "point",
               fun.y = "mean",
               colour = "red",
               size = 1) +
  stat_dots(side = "left",
            justification = 1.3, 
            binwidth = .25,
            dotsize = 5) +
  scale_color_manual(name= 'Statistics',
                   values = (Mean= '#f6546a')) +
  labs(title = 'Monthly Balance in Groups with Diffrent Education Levels', 
       x = 'Education Level', 
       y = 'Monthly Balance',
       subtitle= 'Demographics in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))+
  coord_flip()

A combination of raincloud plot, boxplot and dot plot are used here. It further clarify our belief that education level means a lot to the wealth accumulation.

Distribution of monthly income versus cost in different education levels

The code chunk below shows the change of monthly cost versus monthly income during this 15 months in an iteratively way.

hide
ggplot(combine, aes(x = monthly_income, 
                    y = abs(monthly_cost),
                    size = Wage,
                    colour = combine$educationLevel)) +
  geom_point(alpha = 0.5, 
             show.legend = TRUE) +
  labs(title = 'Month: {frame_time}', 
       x = 'Monthly Cost', 
       y = 'Monthly Income') +
  transition_time(as.integer(SerialMonth)) +
  ease_aes('linear')

In the beginning, the change of income and cost changed quickly, then it leveled off.

3.3.2 Financial patterns in different interest group

Distribution of Monthly Income of Residents from Diffrent Interest Group

hide
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean Income per Month:", mean, "+/-", sem)
} 
p <- ggplot(data=combine,
            aes(x = interestGroup),) +
  stat_summary(aes(y = monthly_income, 
                   tooltip = after_stat(tooltip(y, ymax))),
               fun.data = "mean_se",
               geom = GeomInteractiveCol,
               fill = "light blue") +
  stat_summary(aes(y = monthly_income),
               fun.data = mean_se,
               geom = "errorbar", width = 0.2, size = 0.2) +
  labs(title = 'Monthly Income of Residents from Diffrent Interest Group', 
       x = 'Interest Group', 
       y = 'Monthly Income',
       subtitle= 'Demographics in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

girafe(ggobj = p,
       width_svg = 8,
       height_svg = 8*0.618)

Table above shows that the average wage of residents having interest of D is the highest and that of F is the lowest.

Distribution of Monthly Cost of Residents from Diffrent Interest Group

hide
p <- ggplot(combine, aes(x = interestGroup, 
                    y = abs(monthly_cost))) +
  scale_y_continuous(breaks = seq(0, 5000, 1000), 
                     limits = c(0, 5000)) +
  stat_halfeye(adjust = 0.5, 
               width = .66, 
               color = NA,
               justification = -.01,
               position = position_nudge(x = .15)) + 
  geom_boxplot(width = .20,
               outlier.shape = NA) +
  stat_summary(geom = "point",
               fun.y = "mean",
               colour = "red",
               size = 1) +
  stat_dots(side = "left",
            justification = 1.3, 
            binwidth = .25,
            dotsize = 5) +
  scale_color_manual(name= 'Statistics',
                   values = (Mean= '#f6546a')) +
  labs(title = 'Monthly Cost of Residents from Diffrent Interest Group', 
       x = 'Interest Group', 
       y = 'Monthly Cost',
       subtitle= 'Demographics in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))+
  coord_flip()
p
hide
ggplotly(p)

While the mean income of those from interest group F is the lowest, it’s atotally a different case for the mean of the cost. The average cost of those from interest group F is the highest.

4. Conclusions